Supervised nonnegative matrix factorization

ABSTRACT

Supervised nonnegative matrix factorization (SNMF) generates a descriptive part-based representation of data, based on the concept of nonnegative matrix factorization (NMF) aided by the discriminative concept of graph embedding. An iterative procedure that optimizes suggested formulation based on Pareto optimization is presented. The present formulation removes any dependence on combined optimization schemes. Analytical and empirical evidence is presented to show that SNMF has advantages over popular subspace learning techniques as well as current state-of-the-art techniques.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No.12/854,781 entitled “Supervised Nonnegative Matrix Factorization” filedon the same day as the instant application,. and U.S. patent applicationSer. No. 12/854,776 entitled “Supervised Nonnegative MatrixFactorization” filed on the same day as the instant application. Theserelated applications are hereby incorporated by reference for allpurposes.

FIELD OF INVENTION

The present invention relates to the field of matrix factorization. Morespecifically, it relates to the field of matrix factorization withincorporated data classification properties.

DESCRIPTION OF RELATED ART

Nonnegative matrix factorization (NMF) has recently been used forvarious applications, such as face recognition, multimedia, text mining,and gene expression discovery. NMF is a part-based representationwherein nonnegative inputs are represented by additive combinations ofnonnegative bases. The inherent normegativity constraint in NMF leads toimproved physical interpretation compared to other factorizationmethods, such as Principal Component Analysis (PCA). Psychological andphysiological evidence for part-based representations in the brainappear to support the advantage of NMF.

Interestingly, NMF does not explain the recognition capability of thebrain since NMF and its variants have been designed not forclassification but for reconstruction. The lack of classificationcapability is a natural consequence of the unsupervised factorizationmethod that does not utilize relationships within the input entities,such as class labels.

Several approaches have been proposed for NMF to generate moredescriptive features for classification and clustering tasks. Forexample, “Fisher Nonnegative Matrix Factorization”, ACCV, 2004, by Y.Wang, Y. Jiar, C. Hu, and M. Turk, proposes incorporating the NMF costfunction and the difference of the between-class scatter from thewithin-class scatter. However, the objective of this Fisher-NMF is notguaranteed to converge since it may not be a convex function.“Non-negative Matrix Factorization on Manifold”, ICDM, 2008, by D. Cai,X. He, X. Wu, and J. Han proposes graph regularized NMF (GNMF), whichappends the term representing the favorite relationships among featurevector pairs. But, GNMF is handicapped by not considering unfavorablerelationships.

Recently, J. Yang, S. Yang, Y. Fu, X. Li, and T. Huang proposed“Non-negative graph embedding” (NGE), in CVPR, 2008. NGE resolved theprevious problems by introducing the concept of complementary space soas to be widely considered the state-of-the-art. NGE, however, utilizedthe approximate formulation of graph embedding, and as a result, NGE isnot effective enough for classification, particularly when intra-classvariations are large. This limitation is highlighted in experimentalresults shown below.

In a general sense, all of these previous works tried to incorporate NMFwith graph embedding, but none of them successfully adopted the originalformulation of graph embedding because the incorporated optimizationproblem is intractable. In addition, all the works are limited in thatthey depend on suitable parameters which are not easy to be determinedappropriately.

SUMMARY OF INVENTION

It is an object of the present invention to incorporate NMF with graphembedding using the original formulation of graph embedding.

It is another object of the present invention to permit the use ofnegative values in the definition of graph embedding without violatingthe requirement of NMF to limit itself to nonnegative values.

The above objects are met in a matrix factorization method, a method forapplying the factorization method to test data classification, and asystem for classifying test data using the matrix factorization method.

A preferred embodiment of the present invention is a method offactorizing a data matrix U file by supervised nonnegativefactorization, SNMF, consisting of: providing a data processing deviceto implement the following step: accessing the data matrix U from a datastore, wherein data matrix U is defined as Uε

^(d×n); defining an intrinsic graph G, wherein G={U, W}, each column ofUε

^(d×n) represents a vertex, and each element of similarity matrix Wmeasures the similarity between vertex pairs; defining a penalty graphG, wherein G={U, W} and each element of dissimilarity matrix W measuresunfavorable relationships between the vertex pairs; defining anintrinsic diagonal matrix D, wherein D=[D_(ij)] and D_(ii)=Σ_(j=1)^(n)W_(ij); defining an intrinsic Laplacian matrix L, wherein L=D−W;defining a penalty diagonal matrix D, wherein D=[ D _(ij)] and D_(ii)=Σ_(j=1) ^(n) W _(ij); defining a penalty Laplacian matrix L,wherein L= D− W; defining a basis matrix V, where Vε

^(d×r); defining a feature matrix X, where Xε

^(r×n); defining a measure of the compactness of intrinsic graph G bythe weighted sum of squared distances defined as Σ_(i<j)^(n)W_(ij)∥x_(i)−x_(j)∥²=Tr(XLX^(T)), wherein x_(i) is the i-th columnof X and x_(j) is the j-th column of X; defining a measure of theseparability of penalty graph G by the weighted sum of squared distancesdefined as Σ_(i<j) ^(n) W _(ij)∥x_(i)−x_(j)∥²=Tr(X LX^(T)), whereinx_(i) is the i-th column of X and x_(j) is j-th column of X; definingF⁽¹⁾(V, X) as an objective of NMF (nonnegative matrix factorization),wherein as F⁽¹⁾(V, X)=∥U−VX∥_(F) ²; defining F⁽²⁾(X) as an objective ofgraph embedding, where

${{F^{(2)}(X)} = \frac{{Tr}\left( {X\; L\; X^{T}} \right)}{{Tr}\left( {X\;\overset{\_}{L}X^{T}} \right)}};$applying Pareto optimality to F⁽¹⁾(V, X) and F⁽²⁾(X); and defining thefinal state V*, X* of matrices V and X at the Pareto optimal resultingfrom application of the Pareto optimality as a factorization of datamatrix U.

Preferably, data matrix U is comprised of n samples and each column of Urepresents a sample. Optionally, each of the samples may be an imagefile.

Also in this method, W and W are generated from true relationships amongdata pairs. Theses data pairs may be class labels of data.

This method defines each column of feature matrix X as a low dimensionalrepresentation of the corresponding column of matrix U.

Preferably, the ratio formation of F⁽²⁾(X) is handled without anytransformation. Additionally, at least one of similarity matrix W ordissimilarlty matrix W has negative values. It may be noted, however,that Tr(XLX^(T)) and Tr(X LX^(T)) are positive.

In this method, the Pareto optimality is applied directly on the ratioformulation of F⁽²⁾(X) in the absence of any weighed sum approximation.

Preferably, the Pareto optimality is applied through a series of Paretoimprovement status update iterations defined as a change from a currentstatus (V, X) to a new status (V′, X′) that achieves a Paretoimprovement until the Pareto optimal is achieved. A status update is aPareto improvement if either of the following two conditions issatisfied:F ⁽¹⁾(V′,X′)<F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)≦F ⁽²⁾(V,X)  1)F ⁽¹⁾(V′,X′)≦F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)<F ⁽²⁾(V,X)  2)Additionally, a current status is a Pareto optimal (V*, X*) if there isno other status (V′, X′) such that a status update iteration from (V*,X*) to (V′, X′) is a Pareto improvement.

In a more specific implementation of this method, the Pareto optimalityis applied to F⁽¹⁾(V, X) and to F⁽²⁾(X) through iterativelymultiplicative updates until the Pareto optimal is achieved. This isdone by letting λ=F⁽²⁾(X); letting

${T_{ij}^{(1)} = \frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}};$letting

${T_{ij}^{(2)} = \frac{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{+} \right)_{ij}}};$letting matrix (L−λ L)⁺ be defined as (L−λ L)⁺=A⁺=[A_(ij) ⁺] and

$A_{ij}^{+} = \left\{ \begin{matrix}A_{ij} & {{{if}\mspace{14mu} A_{ij}} > 0} \\0 & {{otherwise};}\end{matrix} \right.$and letting matrix (L−λ L)⁻ be defined as (L−λ L)⁻=A⁻=[A_(ij) ⁻] and

$A_{ij}^{-} = \left\{ \begin{matrix}{- A_{ij}} & {{{if}\mspace{14mu} A_{ij}} < 0} \\0 & {{otherwise}.}\end{matrix} \right.$Under these conditions, the multiplicative updates are:

$\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right.$and X_(ij)←X_(ij)·h(T_(ij) ⁽¹⁾,T_(ij) ⁽²⁾), where

${h\left( {a,b} \right)} = \left\{ \begin{matrix}{\min\left( {a,b} \right)} & {{{if}\mspace{14mu} a} > {1\mspace{14mu}{and}\mspace{14mu} b} > 1} \\{\max\left( {a,b} \right)} & {{{if}\mspace{14mu} a} < {1\mspace{14mu}{and}\mspace{14mu} b} < 1} \\1 & {{otherwise}.}\end{matrix} \right.$The Pareto optimal is then achieved when the multiplicative updatesreach a stationary point.

In this method, the similarity matrix W and dissimilarity matrix W maybe defined by the concept of within-class and between-class distances ofLinear Discriminant Analysis (LDA). In this case, similarity matrixW=[W_(ij)] is defined as:

$W_{ij} = \left\{ \begin{matrix}\frac{1}{n_{c}} & {{{if}\mspace{14mu} y_{i}},{y_{j} \in c}} \\0 & {otherwise}\end{matrix} \right.$wherein y_(i) is a class label of the i-th sample, y_(j) is a classlabel of the j-th sample, and n_(c) is the size of class c. Similarly,dissimilarity matrix W=[ W _(ij)] is defined as

${\overset{\_}{W}}_{ij} = {\frac{1}{n} - W_{ij}}$wherein n is the total number of data points.

This method may be applied to a method of classifying test data,comprising: arranging a set of training data into a data matrix U;applying the above-described supervised nonnegative factorization methodto data matrix U to identify the Pareto optimal state V* and X* offactorizing matrices V and X at the Pareto optimal; and classifying thetest data according only to the classification defined by X*.

The present objects are also met in a data classification system forclassifying test data, comprising: a data processing device with accessto a data matrix U of training data and with access to the test data,the data matrix U being defined as Uε

^(d×n); wherein an intrinsic graph G is defined as G={U,W}, each columnof Uε

^(d×n) representing a vertex and each element of similarity matrix Wmeasuring the similarity between vertex pairs; a penalty graph G isdefined as G={U, W} and each element of dissimilarity matrix W measuresunfavorable relationships between the vertex pairs; an intrinsicdiagonal matrix D is defined as D=[D_(ij)] and D_(ii)=Σ_(j=1)^(n)W_(ij); an intrinsic Laplacian matrix L is defined as L=D−W; apenalty diagonal matrix D is defined as D=[ D _(ij)] and D _(ii)=Σ_(j=1)^(n) W _(ij); a penalty Laplacian matrix L is defined as L= D− W; abasis matrix V is defined as Vε

^(d×r) a feature matrix X is defined as Xε

^(r×n); a measure of the compactness of intrinsic graph G is defined bythe weighted sum of squared distances defined as Σ_(i<j)^(n)W_(ij)∥x_(i)−x_(j)∥²=Tr(XLX^(T)), wherein x_(i) is the i-th columnof X and x_(j) is the j-th column of X.

In this system, a measure of the separability of penalty graph G isdefined by the weighted sum of squared distances defined as Σ_(i<j) ^(n)W _(ij)∥x_(i)−x_(j)∥²=Tr(X LX^(T)), wherein x_(i) is the i-th column ofX and x_(j) is j-th column of X; F⁽¹⁾(V, X) defines an objective of NMF(nonnegative matrix factorization), wherein as F⁽¹⁾(V, X)=∥U−VX∥_(F) ²;F⁽²⁾(X) defines an objective of graph embedding, where

${{F^{(2)}(X)} = \frac{{Tr}\left( {X\; L\; X^{T}} \right)}{{Tr}\left( {X\;\overset{\_}{L}X^{T}} \right)}};$and the data processing device applies Pareto optimality to F⁽¹⁾(V, X)and F⁽²⁾(X), defines the final state V*, X* of matrices V and X at thePareto optimal resulting from application of the Pareto optimality as afactorization of data matrix U, and classifies the test data accordingto only the classification defined by X*.

Preferably data matrix U is comprised of n samples and each column of Urepresents a sample. Each of the samples may be an image file. Also, thedata pairs are preferably class labels of data.

Further preferably, each column of feature matrix X is a low dimensionalrepresentation of the corresponding column of U.

Additionally, the ratio formation of F⁽²⁾(X) may be handled without anytransformation. It is further noted that at least one of similaritymatrix W or dissimilarlty matrix W may have negative values, and thePareto optimality may be applied directly on the ratio formulation ofF⁽²⁾(X) in the absence of any weighed sum approximation.

Preferably, the Pareto optimality is applied through a series of Paretoimprovement status update iterations defined as a change from a currentstatus (V, X) to a new status (V′, X′) that achieves a Paretoimprovement until the Pareto optimal is achieved, and a status update isa Pareto improvement if either of the following two conditions issatisfied:F ⁽¹⁾(V′,X′)<F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)≦F ⁽²⁾(V,X)  1)F ⁽¹⁾(V′,X′)≦F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)<F ⁽²⁾(V,X)  2)and a current status is a Pareto optimal (V*, X*) if there is no otherstatus (V′, X′) such that a status update iteration from (V*, X*) to(V′, X′) is a Pareto improvement.

In a more specific example, Pareto optimality is applied to F⁽¹⁾(V, X)and to F⁽²⁾(X) through iteratively multiplicative updates until thePareto optimal is achieved. This is achieved by first letting λ=F⁽²⁾(X);letting

${T_{ij}^{(1)} = \frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}};$letting

${T_{ij}^{(2)} = \frac{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}};$letting matrix (L−λ L)⁺ be defined as (L−λ L)⁺=A⁺=[A_(ij) ⁺] and

$A_{ij}^{+} = \left\{ \begin{matrix}A_{ij} & {{{if}\mspace{14mu} A_{ij}} > 0} \\0 & {{otherwise};}\end{matrix} \right.$letting matrix (L−λ L)⁻ be defined as (L−λ L)⁻=A⁻=[A_(ij) ⁻] and

$A_{ij}^{-} = \left\{ \begin{matrix}{- A_{ij}} & {{{if}\mspace{14mu} A_{ij}} < 0} \\0 & {{otherwise}.}\end{matrix} \right.$The multiplicative updates may then be defined as:

$\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right.$and X_(ij)←X_(ij)·(T_(ij) ⁽¹⁾,T_(ij) ⁽²⁾), where

${h\left( {a,b} \right)} = \left\{ \begin{matrix}{\min\left( {a,b} \right)} & {{{if}\mspace{14mu} a} > {1\mspace{14mu}{and}\mspace{14mu} b} > 1} \\{\max\left( {a,b} \right)} & {{{if}\mspace{14mu} a} < {1\mspace{14mu}{and}\mspace{14mu} b} < 1} \\1 & {{otherwise}.}\end{matrix} \right.$In this case, the Pareto optimal is achieved when the multiplicativeupdates reach a stationary point.

Also in this system, similarity matrix W=[W_(ij)] is defined as:

$W_{ij} = \left\{ \begin{matrix}\frac{1}{n_{c}} & {{{if}\mspace{14mu} y_{i}},{y_{j} \in c}} \\0 & {otherwise}\end{matrix} \right.$wherein y_(i) is a class label of the i-th sample, y_(j) is a classlabel of the j-th sample, and n_(c) is the size of class c; anddissimilarity matrix W=[ W _(ij)] is defined as

${\overset{\_}{W}}_{ij} = {\frac{1}{n} - W_{ij}}$wherein n is the total number of data points.

Other objects and attainments together with a fuller understanding ofthe invention will become apparent and appreciated by referring to thefollowing description and claims taken in conjunction with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings wherein like reference symbols refer to like parts.

FIG. 1 is a flowchart of a preferred SNMF method in accord with thepresent invention.

FIG. 2 illustrates multiplicative updates for application of Paretooptimization in accord with FIG. 1.

FIG. 3 shows exemplary hardware for implementing the present invention.

FIG. 4 shows Table 1 summarizing a comparison of the recognitionproperties of the present invention (SNMF) versus various otherrecognition methods known in the art.

FIG. 5 shows examples of three sizes of synthetic occlusions placed on areference image.

FIG. 6A shows compares the results of the present invention versusvarious known recognition methods applied to the FERRET database withthe synthetic occlusions of FIG. 5.

FIG. 6B shows compares the results of the present invention versusvarious known recognition methods applied to the JAFFE database with thesynthetic occlusions of FIG. 5.

FIG. 7 illustrates five sample images with natural variations providedby the AR database.

FIG. 8 shows Table 2, which compares the present invention to variousrecognition methods known in the art when applied to the AR database.

FIG. 9 shows the basis images of SNMF (61), NMF (62), LNMF (63), and NGE(64) for the AR database

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Recently, Nonnegative Matrix Factorization (NMF) has received muchattention due to its representative power for nonnegative data. Thediscriminative power of NMF, however, is limited by its inability toconsider relationships present in data, such as class labels. Severalworks tried to address this issue by adopting the concept of graphembedding, albeit in an approximated form. In this paper, we proposeSupervised NMF (SNMF) that incorporates the objective function of graphembedding without any approximations, which was previously deemed to beintractable. We present a novel way of considering the objectivefunctions of NMF and graph embedding separately and finding a Paretooptimum through iteratively searching for Pareto improvements. As aresult, SNMF achieves higher classification performance since it doesnot compromise the full potential of graph embedding. Furthermore, SNMFis designed for computational efficiency, no parameter tuning,formulation flexibility, and applicability to NMF variants. Empiricalevidence demonstrates the success of SNMF in terms of robustness tovariations as well as recognition accuracy compared to state-of-the-arttechniques.

The present work proposes an approach herein named supervisednonnegative matrix factorization (SNMF), which is intended to achieveclassification capability based on the benefits of the part-basedrepresentation of NMF. First of all, SNMF adopts the originalformulation of graph embedding and solves a system of optimizationproblems through iteratively searching for a Pareto improvement.Generally, Pareto optimization is a concept in economics and gametheory, but is herein applied to more effectively incorporate graphimbedding concepts to NMF.

Due to the effectiveness of the original formulation, SNMF shows betterperformance than previous works for classification. Aside from theclassification power, SNMF is designed to have several advantages overprevious works, including computational efficiency, no parameter tuning,formulation flexibility, and applicability to NMF variants. SNMF alsooutperforms other subspace learning methods in terms of recognitionaccuracy and robustness to variations, especially when intra-classvariations are not trivial.

With reference to FIG. 1, SNMF combines the benefits of non-negativematrix factorization (NMF) and graph embedding, each of which isdiscussed in turn.

Before describing SNMF, it is beneficial to first provide backgroundinformation regarding non-negative matrix factorization (NMF) and graphembedding. For completeness, a briefly introduce to non-negative graphembedding (NGE) is also provided since it is currently considered astate-of-the-art approach.

Like NMF, SNMF factorizes a matrix U into the product of two, preferablysmaller matrices: a basis matrix V (where Vε

^(d×r)) and a coefficient matrix (or feature matrix) X (where Xε

^(r×n)). For example, matrix U may be a raw data matrix of n samples (ordata points) with each sample being of dimension d such that Uε

^(d×n) (step S1). A specific example of this may be if each of the ncolumns of U (i.e. each of the n samples) is an image of size d. MatrixU is factorized into the product of basis matrix V and a feature matrixX by minimizing the following reconstruction error:

$\begin{matrix}{{{\min\limits_{V,X}{{{U - {VX}}}_{F}^{2}\mspace{14mu}{s.t.\mspace{14mu} V_{ik}}}} \geq {0\mspace{14mu}{and}\mspace{14mu} X_{kj}} \geq {0\mspace{14mu}{\forall i}}},j,\mspace{14mu}{{and}\mspace{14mu} k}} & (1)\end{matrix}$Where ∥·∥_(F) denotes the Frobenius norm. Since Eq. (1) is not a convexfunction of both V and X, there is no closed form solution for theglobal optimum. Thus, many researchers have developed iterative updatemethods to solve the problem. Among them, a popular approach is themultiplicative updates devised by Lee and Seung in “Learning the partsof objects by non-negative matrix factorization”, Nature, 401:788-791,1999, which is hereby incorporated in its entirety by reference. Thesemultiplicative updates, shown below as equation (2),

$\begin{matrix}{\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right.,\mspace{14mu}\left. X_{ij}\leftarrow{X_{ij}\frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}} \right.} & (2)\end{matrix}$are popular due to their simplicity. These updates monotonicallydecrease the objective function in Eq. (1).

Graph embedding, on the other hand, may be defined as the optimal lowdimensional representation that best characterizes the similarityrelationships between data pairs. In graph embedding, dimensionalityreduction involves two graphs: an intrinsic graph that characterizes thefavorable relationships among feature vector pairs and a penalty graphthat characterizes the unfavorable relationships among feature vectorpairs. Thus, applying graph embedding to data matrix U would organizeits raw data into classes according to specified favorable andunfavorable relationships. To achieve this, however, one first needs todefine graph embedding as applied to data matrix U.

For graph embedding, one lets G={U,W} be an intrinsic graph where eachcolumn of Uε

^(d×n) represents a vertex and each element of W measures the similaritybetween vertex pairs (step S3). In the same way, a penalty graph G,which measures the unfavorable relationships between vertex pairs may bedefined as G={U, W} (step S5). In this case, W and W can be generatedfrom true relationships among data pairs, such as class labels of data.In addition, the diagonal matrix D=[D_(ij)] is defined, whereD_(ii)=Σ_(j=1) ^(n)W_(ij) (step S7) and the Laplacian matrix L=D−W isdefined (step S9). Matrices D and L are defined from W in the same way(steps S11 and S13).

As is explained above, to factorize data matrix U, which is defined asUε

^(d×n), one defines a basis matrix V such that Vε

^(d×r) (step S15), defines a feature matrix X such that Xε

^(r×n) (step S17), and seeks to populate V and X such that the productof V and X approximates U with minimal error. An object of the presentinvention, however, is to combine graph embedding with the factorizationof matrix U such that the classification properties of graph embeddingare incorporated into factorized basis matrix V and a feature matrix X.The present embodiment achieves this by defining the objective of graphembedding in terms of feature matrix X.

First, let each column of feature matrix X be a low dimensionalrepresentation of the corresponding column of U. Then, one can measurethe compactness of the intrinsic graph G and the separability of thepenalty graph G by the weighted sum of squared distances of featurematrix X, as follows:Σ_(i<j) ^(n) W _(ij) ∥x _(i) −x _(j)∥² =Tr(XLX ^(T))  (Step S19)Σ_(i<j) ^(n) W _(ij) ∥x _(i) −x _(j)∥² =Tr(X LX ^(T))  (Step S21)where x_(i) is the i-th column of X and x_(j) is j-th column of X.

The objective of graph embedding, as is the case of most dimensionalityreduction methods, can be generalized to the following unified frameworkwith specifically defined W and W.

$\begin{matrix}{\min\frac{{Tr}\left( {XLX}^{T} \right)}{{Tr}\left( {X\overset{\_}{L}X^{T}} \right)}} & (5)\end{matrix}$

As an aside, and before continuing with the discussion of the presentlypreferred embodiment, it may be useful to address how the prior work ofNonnegative Graph Embedding (NGE) attempts to combine the NMF objectiveand the graph embedding objective to achieve both the benefits. First,it must be noted that NGE does not adopt the original graph embeddingobjective of Eq. (5) because it makes NGE's unified optimization problemintractable. Instead, NGE adopts the following transformed formulation.

Suppose that feature matrix X and basis matrix V are divided into twoparts as

$X = \begin{bmatrix}X_{1} \\X_{2}\end{bmatrix}$and V=└V₁V₂┘. Here, NGE considers (X₂,V₂) the complementary space of(X₁, V₁). Then NGE minimizes

$\begin{matrix}{{{\min\limits_{V,X}{{Tr}\left( {Q_{1}X_{1}{LX}_{1}^{T}Q_{1}^{T}} \right)}} + {{Tr}\left( {Q_{2}X_{2}\overset{\_}{L}X_{2}^{T}Q_{2}^{T}} \right)} + {\lambda{{U - {VX}}}_{F}^{2}}}{{s.t.\mspace{14mu} V} \geq {0\mspace{14mu}{and}\mspace{14mu} X} \geq 0}} & (6)\end{matrix}$where Q₁ and Q₂ are diagonal matrices that consist of the norms of basisvectors in V₁ and V₂, respectively. These two matrices are multiplied tocompensate the norms of bases into coefficient matrices.

Yang et al. argued that NGE minimizes the penalty graph term rather thanmaximizing it due to the complementary property between (X₁, V₁) and(X₂, V₂). However, it is doubtful whether the complementary space existswithout violating the nonnegative constraints. Even if the space exists,NGE provides no assurance that it can discover the complementary space.In fact, experimental results, detailed below, show that NGE does notsufficiently maximize separability of the penalty graph.

Returning now to the subject of the present invention: supervisednonnegative matrix factorization (SNMF). The mathematical details of thepresently proposed approach of SNMF are first formalized. It is putforth that the present method generates more descriptive features basedon the part-based representation of NMF and has several advantages overprevious works that were based on a similar idea to that of SNMF. Theexperimental results, shown below, substantiate this claim.

To acquire both the benefits of part-based representation and theclassification power of graph embedding, the present invention addressesboth the objectives of NMF and the objective of graph embedding.However, unlike previous works, the present invention directly handlesthe ratio formation of graph embedding without any transformation.Although it was previously believed that the ratio formulation of graphembedding made its use in NMF intractable, the present approach makesthe problem manageable by taking into account the two objectivesseparately, rather than attempting to manage the weighted sum of them,as is generally proposed in previous works. The two objectives that areconsidered are:F ⁽¹⁾(V,X)=∥U−VX∥ _(F) ²  (7)which is based on the nonnegative matrix factorization objective of Eq.(1) (step S23), and

$\begin{matrix}{{F^{(2)}(X)} = \frac{{Tr}\left( {XLX}^{T} \right)}{{Tr}\left( {X\overset{\_}{L}X^{T}} \right)}} & (8)\end{matrix}$which is based on the original graph embedding objective of Eq. (5)(step S25).

Before describing the present method of optimizing the objectives ofEqs. (7) and (8), it is helpful to first introduce the concept of Paretooptimality of multiple objectives. Pareto optimality is an importantconcept in the field of economics, and has been widely used in gametheory and social sciences. A Pareto improvement is defined as a changein the status of a system from one status to another that can improve atleast one objective without worsening any other objective. The Paretooptimal (or Pareto minimum) is defined as the status where no furtherPareto improvements can be obtained. Based on these general definitions,the present approach first provides two Pareto optimal definitions (orequivalently Pareto minimum definitions) of the objectives in Eqs. (7)and (8).

-   -   Definition 1: A status updates in basis matrix V and feature        matrix X [i.e. a change from a current status (V, X) to a new        status (V′, X′)] is a Pareto improvement if either of the        following two conditions is satisfied.        F ⁽¹⁾(V′,X′)<F ⁽¹⁾(V,X)        and        F ⁽²⁾(V′,X′)≦F ⁽²⁾(V,X)  1)        F ⁽¹⁾(V′,X′)≦F ⁽¹⁾(V,X)        and        F ⁽²⁾(V′,X′)<F ⁽²⁾(V,X)  2)    -   Definition 2: A current status (V*, X*) is a Pareto minimum        (i.e. Pareto optimal) if there is no other status (V′, X′) such        that a status-change from (V*, X*) to (V′, X′) is a Pareto        improvement.

SNMF finds a Pareto minimum of the two objectives (as defined by Eq. (7)and (8)) (step S29) by achieving a Pareto improvement in every iteration(step S27). This approach is essentially different from previousapproaches, which minimize a weighted sum of the two objectives. Thepresent SNMF approach has several advantages over previous works,especially over NGE.

First, SNMF achieves higher descriptive power due to the effectivenessof the original ratio formulation of graph embedding, as shown in Eq.(8), over a weighed sum approximation, as taught by previous works. Thisformulation allows SNMF to operate without any additional concepts, suchas the concept of complementary space utilized in NGE.

Second, SNMF need not sacrifice the NMF reconstruction error to improvethe graph embedding objective, or sacrifice the graph embeddingobjective to improve the NMF reconstruction error because a Paretoimprovement guarantees that both are non-increasing in every iteration.On the other hand, in previous works, it is not guaranteed that boththese terms are consistently non-increasing because only their weightedsum is considered.

Third, the suggested optimization problem can be solved through simplemultiplicative iterative updates, as will is shown below. Thus, SNMF hasa significantly lower computational cost as compared to NGE. It may benoted that another approach known as multiplicative NGE also attempts toreduce the computational cost of NGE, but even compared withmultiplicative NGE, the computational cost of SNMF is more economicalsince SNMF's multiplicative factors are simpler than those ofmultiplicative NGE. Furthermore unlike multiplicative NGE, SNMF need notcompensate the norms of the bases into the coefficient matrix becausethe ratio formulation automatically prevents the objectives fromtrivially decreasing when the bases are rescaled.

Fourth, SNMF does not require any parameters since the two objectives ofEqs. (7) and (8) are separately considered. On the other hand, allprevious works introduced parameters that must be tuned heuristically orempirically, which is cumbersome and in some cases difficult to bedetermined appropriately even by empirical validation. For example, NGErequires one to determine two parameters: 1) a weight parameter forbalancing the NMF objective and the graph embedding objective; and 2) asize parameter for dividing the basis and coefficient matrices into twoparts.

Lastly, SNMF can employ any definitions of similarity and dissimilaritymatrices W and W (including negative values) if both Tr(XLX^(T)) andTr(X LX^(T)) are positive. These constraints are reasonable sinceTr(XLX^(T)) and Tr(X LX^(T)) are distance measures. By contrast, NGErequires more restricted constraints when defining the matrices. Forexample, in NGE, all the elements of W and W must be nonnegative becausenegative elements can make the objective of NGE be a non-convexfunction.

Before describing a detailed implementation of SNMF, a sample definitionof W and W is provided. A presently preferred embodiment defines W and Wby borrowing the concept of within-class and between-class distances ofLinear Discriminant Analysis (LDA), as is generally described, forexample, chapter 5 of book “Pattern Classification” by R. O. Duda, P. E.Hart, and D. G. Stork, published by Wiley-interscience, Hoboken, N.J.,2nd edition, 2001, which is hereby incorporated by reference. Thisapproach begins by letting y_(i) be a class label of the i-th sample andn_(c) be the size of class c. Matrices W=[W_(ij)] and W=[ W _(ij)] arethen defined as

$W_{ij} = \left\{ {{\begin{matrix}\frac{1}{n_{c}} & {{{if}\mspace{14mu} y_{i}},{y_{j} \in c}} \\0 & {otherwise}\end{matrix}\mspace{14mu}{and}\mspace{14mu}{\overset{\_}{W}}_{ij}} = {\frac{1}{n} - W_{ij}}} \right.$where n is the total number of data points

Note that the elements of W can be negative, which means that NGE cannotuse W and W from the LDA formulation, as describe immediately above. Notonly can SNMF adopt the LDA formulation in order to measuresimilarities, but other formulations can be adopted as well. Forexample, for multi-modal data sets, the Marginal Fisher Analysis (MFA)formulation, which effectively reflects local relationships among datacan be used. Information on MFA is provided in “Marginal Fisher Analysisand Its Variants For Human Gait Recognition and Content-based ImageRetrieval”, IEEE Trans on Image Processing, 16(11), 2007, by D. Xu, S.Yan, D. Tao, S. Lin, and H. J. Zhang, which is herein incorporated inits entirety.

A detailed implementation of SNMF now follows.

With reference to FIG. 2, the two SNMF objectives can be optimized bysimple multiplicative updates. More formally, a Pareto improvement ofF⁽¹⁾ and F⁽²⁾ (as defined by Eqs. (7) and (8)) can be obtained by thefollowing multiplicative updates: let λ=F⁽²⁾(X) (step S31) and let

$\begin{matrix}\left. \begin{matrix}{T_{ij}^{(1)} = \frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}} \\{T_{ij}^{(2)} = \frac{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}}\end{matrix} \right\} & (10)\end{matrix}$(steps S33 and S35) where matrix (L−λ L)+=A⁺=[A_(ij) ⁺] (step S37) andmatrix (L−λL)⁻=A⁻=[A_(ij) ⁻] (step S39) are defined as

$\begin{matrix}\left. \begin{matrix}{A_{ij}^{+} = \left\{ \begin{matrix}A_{ij} & {{{if}\mspace{14mu} A_{ij}} > 0} \\0 & {otherwise}\end{matrix} \right.} \\{A_{ij}^{-} = \left\{ \begin{matrix}{- A_{ij}} & {{{if}\mspace{14mu} A_{ij}} < 0} \\0 & {otherwise}\end{matrix} \right.}\end{matrix} \right\} & (11)\end{matrix}$

A Pareto minimum (i.e. Pareto optimal) of F⁽¹⁾ and F⁽²⁾ can be obtainedby applying the following multiplicative updates iteratively (step S41).

$\begin{matrix}\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right. & (12) \\{\left. X_{ij}\leftarrow{X_{ij} \cdot {h\left( {T_{ij}^{(1)},T_{ij}^{(2)}} \right)}} \right.{where}} & (13) \\{{h\left( {a,b} \right)} = \left\{ \begin{matrix}{\min\left( {a,b} \right)} & {{{if}\mspace{14mu} a} > {1\mspace{14mu}{and}\mspace{14mu} b} > 1} \\{\max\left( {a,b} \right)} & {{{if}\mspace{14mu} a} < {1\mspace{14mu}{and}\mspace{14mu} b} < 1} \\1 & {otherwise}\end{matrix} \right.} & (14)\end{matrix}$

SNMF can be easily adopted for other NMF variants; e.g., polynomialkernel NMF (as described in “Non-negative Matrix Factorization InPoly-nominal Feature Space”, IEEE Trans. On Neural Networks,19(6):1090-1100, 2008, by I. Bucis, N. Nikolaidis, and I. Pitas) andkernel NMF (as described in “Nonlinear Nonnegative Component Analysis”,CVPR, pages 2860-2865, 2009, by S. Zafeiriou and M. Petrou). Toincorporate other NMF variants with graph embedding, one can simplycombine the updates of the coefficient matrix with T_(ij) ⁽²⁾ in Eq.(10), which is the multiplicative update factor of graph embedding, asin Eq. (13).

Now the topic of convergence analysis is discussed. First it may benoted, however, that in their seminal work, Lee and Seung, mentionedabove, did not prove that the limit point obtained by the multiplicativeupdates is a local optimum. This convergence issue has been analyzed andresolved by Lin, as described in his paper, “On the Convergence ofMultiplicative Update Algorithms For Non-negative Matrix Factorization”,IEEE Transactions on Neural Networks, 18:1589-1596, 2007, which isherein incorporated in its entirety by reference. Lin achieves this byadding a small non-zero quantity to the denominators of the updates. Ifdesired, the limit point obtained by the updates of SNMF in accord withthe present invention can be shown to be a Pareto minimum by the samemethod.

Returning to the subject of convergence analysis, it is helpful to firstreintroduce the concept of the auxiliary function, which is used bynumerous researchers in the community, and to provided a fewdefinitions, lemmas, proofs, and theorems, as follows:

-   -   Definition 3: G(x, x′) is an auxiliary function for F(x) if the        two following conditions are satisfied.        G(x,x′)≧F(x),G(x,x)=F(x)  (15)        This definition is useful with the following Lemma.    -   Lemma 1: If G(x, x′) is an auxiliary function, then F(x) is        non-increasing under the update of G(x, x^(t)) such that G(x,        x^(t))≦G(x^(t), x^(t)).    -   Proof: F(x^(t+1))≦G(x^(t+1), x^(t))≦G(x^(t),x^(t))≦F(x^(t))

Since only F⁽¹⁾ includes basis matrix V, a Pareto improvement on F⁽¹⁾and F⁽²⁾ can be accomplished by the update of Vin Eq. (12) when X isfixed, unless a Pareto minimal is already achieved. Thus, the next stepis to show that Eq. (13) also performs a Pareto improvement of X when Vis fixed, unless a Pareto minimal is already achieved.

-   -   Lemma 2: F⁽¹⁾ is non-increasing under the following update rule        when V is fixed.        X _(ij) ←X _(ij) ·h(T _(ij) ⁽¹⁾ ,a)  (16)    -   where a is an arbitrary number. X_(ij) is changed by this        update, i.e., if X_(ij)≠0 and h(T_(ij) ⁽¹⁾,a)≠1, then F⁽¹⁾ is        decreasing.    -   Proof: The following update is considered first:        X _(ij) ←X _(ij) T _(ij) ⁽¹⁾  (17)        This is the update of normal NMF. Since F⁽¹⁾ is the objective of        normal NMF, F⁽¹⁾ is non-increasing under this update. And, if        X_(ij)≠0 and T_(ij) ⁽¹⁾≠1, then F⁽¹⁾ is decreasing.

Let X^((ij))(x) be a matrix X whose (i, j) element is replaced with x.Then, for any 0≦c≦1,

$\begin{matrix}\begin{matrix}{{F^{(1)}\left( {V,X} \right)} = {{\left( {1 - c} \right){F^{(1)}\left( {V,X} \right)}} + {{cF}^{(1)}\left( {V,X} \right)}}} \\{\geq {{\left( {1 - c} \right){F^{(1)}\left( {V,X} \right)}} + {{cF}^{(1)}\left( {V,{X^{({ij})}\left( {X_{ij}T_{ij}^{(1)}} \right)}} \right)}}}\end{matrix} & (18)\end{matrix}$Since F⁽¹⁾(V, X) is a convex function of X_(ij),

$\begin{matrix}\begin{matrix}{{F^{(1)}\left( {V,X} \right)} \geq {F^{(1)}\left( {V,{{\left( {1 - c} \right)X} + {{cX}^{({ij})}\left( {X_{ij}T_{ij}^{(1)}} \right)}}} \right)}} \\{= {F^{(1)}\left( {V,{X^{({ij})}\left( {{\left( {1 - c} \right)X_{ij}} + {{cX}_{ij}T_{ij}^{(1)}}} \right)}} \right)}} \\{= {F^{(1)}\left( {V,{X^{({ij})}\left( {X_{ij}\left( {1 - c + {cT}_{ij}^{(1)}} \right)} \right)}} \right)}}\end{matrix} & (19)\end{matrix}$

Let z=1−c+cT_(ij) ⁽¹⁾. If T_(ij) ⁽¹⁾<1, z can be any value between 1 andT_(ij) ⁽¹⁾. Similarly, if T_(ij) ⁽¹⁾<1, z can be any value betweenT_(ij) ⁽¹⁾ and 1. If T_(ij) ⁽¹⁾=1, then z=1. Therefore, F⁽¹⁾ isnon-increasing under the update ruleX _(ij) ←X _(ij) z  (20)for any z such that

$\begin{matrix}\left\{ \begin{matrix}{1 \leq z \leq T_{ij}^{(1)}} & {{{if}\mspace{14mu} T_{ij}^{(1)}} > 1} \\{T_{ij}^{(1)} \leq z \leq 1} & {{{if}\mspace{14mu} T_{ij}^{(1)}} < 1} \\1 & {otherwise}\end{matrix} \right. & (21)\end{matrix}$

This update rule is equivalent to the update in Eq. (16).

Both the equalities in Eqs. (18) and (19) hold only if X_(ij)=0 orT_(ij) ⁽¹⁾=1 or c=0. Thus, if X_(ij)≠0 and z≠1, which is equivalent toh(T_(ij) ⁽¹⁾,a)≠1, the inequalities are strict and F⁽¹⁾ is decreasing.Before moving on to F⁽²⁾, an idea of where T_(ij) ⁽²⁾ in Eq. (10) isderived from is provided. To do so, one definesK(X)=Tr(XLX ^(T))−λTr(X LX ^(T))  (22)which is obtained by transforming F⁽²⁾ into the difference form. This Kplays an auxiliary role in the proof on F⁽²⁾. Now suppose that λ is anarbitrary number.

In order to integrate the non-negative constraints into K, one can setΦ=[Φ_(ij)] as a Lagrange multiplier matrix, in which Φ_(ij) is theLagrange multiplier for the constraint X_(ij)≧0. Then the Lagrange

for K is defined asL=Tr(XLX ^(T))−λTr(X LX ^(T))+Tr(ΦX ^(T))  (23)By setting the derivative of

to zero, one obtains

$\begin{matrix}{\frac{\partial L}{\partial X} = {{{2{XL}} - {2\lambda\; X\overset{\_}{L}} + \Phi} = 0}} & (24)\end{matrix}$Along with KKT condition of Φ_(ij)X_(ij)=0 described in “Nonlinearprogramming”, Proceedings of 2nd Berkeley Symposium, 1951, by H. Kuhnand A. Tucker, and herein incorporated in its entirety by reference,2(XL)_(ij) X _(ij)−2λ(X L) _(ij) X _(ij)+Φ_(ij) X _(ij)=2(X(L−λ L)⁺)_(ij) X _(ij)−2(X(L−λ L) ⁻)_(ij) X _(ij)=0  (25)From this equation, one can obtain the following update.

$\begin{matrix}{\left. X_{ij}\leftarrow{X_{ij}\frac{\left( {X\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\overset{\_}{\; L}}} \right)}^{+} \right)_{ij}}} \right. = {X_{ij}T_{ij}^{(2)}}} & (26)\end{matrix}$

-   -   Lemma 3: K is non-increasing under the following update rule.        X _(ij) ←X _(ij) T _(ij) ⁽²⁾  (27)    -   Proof: The first order and second order derivatives of K with        respect to X_(ij) are respectively computed as

$\begin{matrix}{{\frac{\partial K}{\partial X_{ij}} = {2\left( {X\left( {L - {\lambda\;\overset{\_}{L}}} \right)} \right)_{ij}}},{\frac{\partial^{2}K}{\partial X_{ij}^{2}} = {2\left( {L - {\lambda\overset{\_}{\; L}}} \right)_{jj}}}} & (28)\end{matrix}$Let K^((ij)) be a function obtained by isolating X_(ij) term from K.Then, one can define G as an auxiliary function of K^((ij)) by replacingthe second order derivative in the Taylor series expansion of K^((ij)).

$\begin{matrix}{{G\left( {X_{ij},X_{ij}^{t}} \right)} = \left. {{K^{({ij})}\left( X_{ij}^{t} \right)} + \frac{\partial K}{\partial X_{ij}}} \middle| {}_{X_{ij} = X_{ij}^{t}}{\left( {X_{ij} - X_{ij}^{t}} \right) + {\frac{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}{X_{ij}^{t}}\left( {X_{ij} - X_{ij}^{t}} \right)^{2}}} \right.} & (29)\end{matrix}$To verify that G is an auxiliary function of K, one needs to show

$\begin{matrix}{{{G\left( {X_{ij},X_{ij}^{t}} \right)} - {K^{({ij})}\left( X_{ij} \right)}} = {{\left( {\frac{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}{X_{ij}^{t}} - \left( {L - {\lambda\;\overset{\_}{L}}} \right)_{jj}} \right)\left( {X_{ij} - X_{ij}^{t}} \right)^{2}} \geq 0}} & (30)\end{matrix}$which is equivalent to(X ^(t)(L−λ L )⁺)_(ij) −X _(ij) ^(t)(L−λ L )_(ij)≧0  (31)This inequality is satisfied because(X ^(t)(L−λ L) ⁺)_(ij)=Σ_(k=1) ^(n) X _(ik) ^(t)(L−λ L) _(kj) ⁺ ≧X _(ij)^(t)(L−λ L) _(jj) ⁺ ≧X _(ij) ^(t)(L−λ L) _(jj)  (32)G(X_(ij), X_(ij) ^(t)) is convex; thus, solving

$\frac{\partial{G\left( {X_{ij},X_{ij}^{t}} \right)}}{\partial X_{ij}} = 0$yields X_(ij) ^(t+1) by Lemma 1.

$\begin{matrix}\begin{matrix}{X_{ij}^{t + 1} = {X_{ij}^{t}\frac{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij} - \left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)} \right)_{ij}}{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}}} \\{= {{X_{ij}^{t}\frac{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X^{t}\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}} = {X_{ij}^{t}T_{ij}^{(2)}}}}\end{matrix} & (33)\end{matrix}$Since G is an auxiliary function of K, K is also non-increasing underthis update.

-   -   Lemma 4: K is non-increasing under the following update rule.        X _(ij) ←X _(ij) ·h(T _(ij) ⁽²⁾ ,b)  (34)        where b is an arbitrary number. If X_(ij) is changed by this        update, i.e., if X_(ij)≠0 and h(T_(ij) ⁽²⁾,b)≠1, then K is        decreasing.    -   Proof: Let X_(ij) ^(t+1)=X_(ij) ^(t)T_(ij) ⁽²⁾. Then, for any        0≦c≦1,        G(X _(ij) ^(t) ,X _(ij) ^(t))=(1−c)G(X _(ij) ^(t) ,X _(ij)        ^(t))+cG(X _(ij) ^(t) ,X _(ij) ^(t))≧(1−c)G(X _(ij) ^(t) ,X        _(ij) ^(t))+cG(X _(ij) ^(t+1) ,X _(ij) ^(t))  (35)        Since G(X_(ij), X_(ij) ^(t)) is convex,

$\begin{matrix}\begin{matrix}{{G\left( {X_{ij}^{t},X_{ij}^{t}} \right)} \geq {G\left( {{{\left( {1 - c} \right)X_{ij}^{t}} + {cX}_{ij}^{t + 1}},X_{ij}^{t}} \right)}} \\{= {G\left( {{{\left( {1 - c} \right)X_{ij}^{t}} + {{cX}_{ij}^{t}T_{ij}^{(2)}}},X_{ij}^{t}} \right)}} \\{= {G\left( {{X_{ij}^{t}\left( {1 - c + {cT}_{ij}^{(2)}} \right)},X_{ij}^{t}} \right)}}\end{matrix} & (36)\end{matrix}$

Let z=1−c+cT_(ij) ⁽²⁾. In the same way as in Lemma 2, one can show thatG(X_(ij), X_(ij) ^(t)) is non-increasing under the update ruleX _(ij) ←X _(ij) z  (37)for any z such that

$\begin{matrix}\left\{ \begin{matrix}{1 \leq z \leq T_{ij}^{(2)}} & {{{if}\mspace{14mu} T_{ij}^{(2)}} > 1} \\{T_{ij}^{(2)} \leq z \leq 1} & {{{if}\mspace{14mu} T_{ij}^{(2)}} < 1} \\1 & {otherwise}\end{matrix} \right. & (38)\end{matrix}$

This update rule is equivalent to the update in Eq. (34). Since G is anauxiliary function of K, K is also non-increasing under the update inEq. (34) by Lemma 1. In the same way as in Lemma 2, one can show thatthe inequalities in Eqs. (35) and (36) are strict if X_(ij)≠0 andh(T_(ij) ⁽²⁾,b)≠1

-   -   Lemma 5: F⁽²⁾ is non-increasing under the update in Eq. (34) if

$\begin{matrix}{\lambda = {{F^{(2)}\left( X^{t} \right)} = \frac{{Tr}\left( {X^{t}{LX}^{t^{T}}} \right)}{{Tr}\left( {X^{t}\overset{\_}{L}X^{t^{T}}} \right)}}} & (39)\end{matrix}$at iteration t. If X_(ij) is changed under the update, then F⁽²⁾ isdecreasing.

Proof: Suppose that X_(ij) ^(t+1) is obtained from X_(ij) ^(t) byapplying the update in Eq. (34); i.e., X_(ij) ^(t+1)=X_(ij)^(t)·h(T_(ij) ⁽²⁾,b). Since K(X) is non-increasing under the update inEq. (34),Tr(X ^(t+1) LX ^(t+1) ^(T) )−λTr(X ^(t+1) LX ^(t+1) ^(T) )=K(X^(t+1))≦K(X ^(t))=0  (40)One ensures that Tr(X LX^(T)) is always positive; thus,

$\begin{matrix}{{F^{(2)}\left( X^{t + 1} \right)} = {{\frac{{Tr}\left( {X^{t + 1}{LX}^{t + 1^{T}}} \right)}{{Tr}\left( {X^{t + 1}\overset{\_}{L}X^{t + 1^{T}}} \right)} \leq \lambda} = {F^{(2)}\left( X^{t} \right)}}} & (41)\end{matrix}$If X_(ij) is changed, the inequality in Eq. (40) is strict so that theinequality in Eq. (41) is also strict.

-   -   Theorem 1: A Pareto improvement on F⁽¹⁾ and F⁽²⁾ can be achieved        by the update rules in Eqs. (10) to (14) unless a Pareto minimal        is already achieved.

Proof: By Lemmas 2 and 5, either F⁽¹⁾ or F⁽²⁾ is decreasing under theupdate rule in Eqs. (10) to (14) unless X and V are at a stationarypoint which is a Pareto minimum of F⁽¹⁾ and F⁽²⁾.

The above described method of SNMF, which, as is illustrated below, iswell suited for data classification, may be implemented in various typesof data processing hardware.

With reference to FIG. 3, a general example of such data processinghardware includes a data processing device 11. As it is known in theart, data processing device 11 may be a micro-computer, a centralprocessing unit (CPU), a specilized image processor, a programmablelogic device (PLD), a complex programmable logic device (CPLD), anapplication specific integrated circuit (ASIC), or other computingdevice. In general, data processing device 11 may include an arithmeticlogic unit (ALU) or CPU 13, control logic 15, various timing clocks 17,various types of registers 19 (including data registers, shiftregisters, workspace registers, status registers, address registers,interrupt registers, instruction registers, program counters, etc.), anda memory unit 21 (including RAM and/or ROM).

In the present example of FIG. 3, raw data matrix U of n samples, whichmay consist of training data when used for data classification orcategorization, may be maintain in a data store 23. Data processingdevice 11 may directly access data store 23 via a direct link 32 andappropriate input/output interface 27, or may alternatively access datastore 23 via communication links 31/33 and network 29, which may be aLAN, WLAN, or the Internet.

Similarly, test data 37, which is the data that is to be classified, maybe accessible via a direct link 34 or through communication network 29and communication links 31/35. It is to be understood that test data 37may be an archive of data (such as a store of face images) or may begenerated in real time (such as face images created by surveillancecameras). It is further to be understood that communication links 31-35may be wired or wireless communication links.

In the following section, the presently preferred embodiment isevaluated using three standard face databases in public use in thefield: the FERET database, the JAFFE database, and the AR database.

The FERET database contains 420 images of 70 people. For each subjectperson, six frontal-view images are provided.

The JAFFE database contains 213 images of 10 Japanese female subjects.For each subject, 3 or 4 samples for each of 7 basic facial expressionsare provided.

The AR database contains 4000 images of 126 people, including 70 men and56 women, and provides frontal-view images with different facialexpressions, illumination conditions, and natural occlusions for eachperson.

For evaluation purposes, once the face region is cropped, each image isdown-sampled to 32×32 pixels for the FERET and AR databases, anddown-sampled to 40×30 pixels for the JAFFE database. Following thetypical approach of previous works, three images from each person in theFERET database, 150 images from the JAFFE database, and seven imagesfrom each person of the AR database are randomly selected as a trainingset (i.e. training data), and the rest are utilized as a test set (i.e.test data).

To test the effectiveness of the present SNMF approach, SNMF is comparedwith seven other popular subspace learning algorithms: PrincipalComponent Analysis (PCA), Independent Component Analysis (ICA),Nonnegative Matrix Factorization (NMF), Localized NMF (LNMF), LinearDiscriminant Analysis (LDA), Marginal Fisher Analysis (MFA), andNonnegative Graph Embedding (NGE). For NGE, the multiplicative updatesproposed in “Multiplicative Nonnegative Graph Embedding”, CVPR, 2009, byWang et al., are implemented, and the parameter for balancing the NMFpart and the graph embedding part is set to 1, as is suggested in“Non-negative Graph Embedding”, CVPR, 2008, by Yang et al.

For MFA and NGE, the protocol described in “Marginal Fisher Analysis andIts Variants For Human Gait Recognition and Content-based ImageRetrieval”, IEEE Trans on Image Processing, 16(11), 2007, by Xu et al.is followed to build the intrinsic graph and the penalty graph.

For classification, as is conventional, the 1-Nearest Neighborclassifier (1-NN) is used because of its computational efficiency. Aftertesting 10 times, the mean and the standard deviation of theclassification accuracies are reported. A sufficient number ofdimensions that nearly cover all possible dimensions of the embeddingspace are explored, and the best performance is reported.

With reference to Table 1 in FIG. 4, the classification capability ofthe present SNMF approach is compared with the above-noted sevensubspace learning algorithms, as applied to the FERET and JAFFEdatabases. Performance for SNMF exceeds both the classical dimensionreduction methods (PCA and ICA) as well as NMF variants (NMF, LNMF, andNGE). In particular, SNMF outperforms NGE although the two approachesare developed from the similar idea, which indicates the formulation ofSNMF generates more discriminative features than NGE.

Superior performance of SNMF as compared to other supervised methods,namely LDA and MFA, can be attributed to the fact that SNMF minimizesthe reconstruction error while utilizing label information. In thisrespect, SNMF has both the advantages of unsupervised representationapproaches and supervised classification approaches. LDA performscomparable to present SNMF method on FERET and JAFFE databases; however,LDA is not robust to variations, as shown below. MFA also reportsslightly less accuracy than SNMF on JAFFE, but the performance gapbetween the two approaches is significant on the FERET database. MFAtends to over reduce the dimension of data during its preprocessing stepto avoid singular value issues when the number of training samples issmall relatively to the number of classes. Compared to LDA and MFA, thepresent approach reliably demonstrates the best performance.

In another exemplary application of the present invention, syntheticpatches (i.e. occlusions regions that hide a region of an image) areplaced on the original images from FERET and JAFFE database. The patchpixel sizes are one of 5×5, 10×10, and 15×15, and the patch locationsare randomly chosen. For each of patch sizes, recognition accuraciesafter 10 tests were computed.

For example, FIG. 5 shows four sample images 41-44 from the FERETdatabase with three of the four images having synthetic occlusions. Theleftmost image 41 is the original image. From second left to right, theocclusion sizes are 5×5 pixels for image 42, 10×10 pixels for image 43,and 15×15 pixels for image 44. The locations of occlusions are randomlyselected

The results for FERRET database are shown in the plot of FIG. 6A, andthe results for the JAFFE database are shown in plot of FIG. 6B. As canbe seen, SNMF constantly dominates the other approaches. The greater theocclusion sizes are, the clearer the performance gaps between SNMF andthe other methods are. Of particular interest is the performance of LDAwhich is comparable to SNMF in the case of no occlusion, but dropsdrastically with growing occlusion size.

Although SNMF is robust to synthetic occlusions, the result does notnecessarily imply that the algorithm is robust to natural variations inreal world situations. The AR database is used to investigate how SNMFperforms with natural variations. The AR database is selected since itcontains various real world variations, such as different facialexpressions, illumination conditions, and natural occlusions. Sampleimages of the AR database are illustrated in FIG. 7.

As is illustrated in the sample images of the AR database shown in FIG.7, the AR database provides reference subject (51) and also providesdifferent facial expressions (52), illumination conditions (53), andnatural occlusions (54 and 55) of the same subject.

Face recognition accuracies (%) on the AR database are shown in Table 2of FIG. 8. After testing 10 times, the mean and the standard deviationof the 10 accuracies are reported.

As can be seen in Table 2, SNMF outperforms the other techniques for theAR database as well, which implies that SNMF can be successfully appliedto real world problems. LDA performs much worse than SNMF on the ARdatabase due to the large variations within the same class. SNMF clearlyoutperforms NGE as well. Based on the experimental results, it is putforth that the complementary space idea does not effectively maximizeseparability of the penalty graph when within-class variation is large,as evident in the AR database. In such a case, NGE jumbles up featuresfrom different classes while minimizing compactness of the intrinsicgraph.

It is further put forth that SNMF shows more discriminating power androbustness to variations because SNMF not only produces generally sparsebases, but also ignores uninformative details based on class labels. Inthis respect, the bases of SNMF are meaningfully sparse.

For illustrative purposes, FIG. 9 shows the basis images of SNMF (61),NMF (62), LNMF (63), and NGE (64) for the AR database. The basis imagesof SNMF 61 describe less details than those of NMF 62 and NGE 64 sinceSNMF effectively excludes meaningless details. Although the basis imagesof LNMF 63 are sparser, they do not contain meaningful local components.

SNMF also has advantages on localization. In the basis images of SNMF61, one can more clearly see each face component and variation: e.g.,eyes vs. mouth regions, sunglasses, scarves, and illumination changes.SNMF automatically assigns more basis images to describe moreinformative parts by minimizing both the reconstruction error and thenoise/signal ratio.

In this work, concept of SNMF, which generates a descriptive part-basedrepresentation of data, based on the concept of NMF aided by thediscriminative idea of graph embedding is proposed. An iterativeprocedure which optimizes the suggested formulation based on Paretooptimization is presented. The formulation presented in this workremoves the dependence on combined optimization schemes, which can bedivergent without proper parameter tuning. The analytical and empiricalevidence show that SNMF has advantages over popular subspace learningtechniques as well as the state-of-the-art in the relevant field. Webelieve that the idea of SNMF can be successfully applied to otherapplication domains, such as spam filtering and gene expressiondiscovery. We leave the experimental validation as future work. Anotherfuture direction is to develop faster algorithms of SNMF based onrecently proposed faster NMF algorithms.

While the invention has been described in conjunction with severalspecific embodiments, it is evident to those skilled in the art thatmany further alternatives, modifications and variations will be apparentin light of the foregoing description. Thus, the invention describedherein is intended to embrace all such alternatives, modifications,applications and variations as may fall within the spirit and scope ofthe appended claims.

What is claimed is:
 1. A method of factorizing a data matrix U file bysupervised nonnegative factorization, SNMF, comprising: providing a dataprocessing device to implement the following step: accessing said datamatrix U from a data store, wherein data matrix U is defined as UεR^(d×n); defining an intrinsic graph G, wherein G={U,W}, each column ofU εR^(d×n) represents a vertex, and each element of similarity matrix Wmeasures the similarity between vertex pairs; defining a penalty graphG, wherein G={U, W} and each element of dissimilarity matrix W measuresunfavorable relationships between said vertex pairs; defining anintrinsic diagonal matrix D , wherein D=[D_(ij) ]and D_(ii) =Σ_(j=1)^(n)W_(ij); defining an intrinsic Laplacian matrix L, wherein L =D −W;defining a penalty diagonal matrix D, wherein D=[ D _(ij)] and D_(ii)=Σ_(j=1) ^(n) W _(ij); defining a penalty Laplacian matrix L,wherein L= D− W; defining a basis matrix V, where V εR^(dxr); defining afeature matrix X, where X εR^(rxn); defining a measure of thecompactness of intrinsic graph G by the weighted sum of squareddistances defined as Σ_(i<j) ^(n)W_(ij)∥x_(i)−x_(j)∥²=Tr(XLX^(T)),wherein x_(i) is the i-th column of X and x_(J) is j-th column of X;defining a measure of the separability of penalty graph G by theweighted sum of squared distances defined as Σ_(i<j) ^(n) W_(ij)∥x_(i)−x_(j)∥²=Tr(X LX^(T)) , wherein x_(i) is the i-th column of Xand x_(j) is j-th column of X ; defining F(¹)(V,X) as an objective ofNMF (nonnegative matrix factorization), wherein F(¹)(V,X)=∥U −VX ∥_(F)²; defining F(²)(X) as an objective of graph embedding, where${{F^{(2)}(X)} = \frac{{Tr}\left( {XLX}^{T} \right)}{{Tr}\left( {X\overset{\_}{L}X^{T}} \right)}};$applying Pareto optimality to F(¹)(V,X) and F(²)(X); and defining thefinal state V*, X* of matrices V and X at the Pareto optimal resultingfrom application of said Pareto optimality as a factorization of datamatrix U.
 2. The method of claim 1, wherein data matrix U is comprisedof n samples and each column of U represents a sample.
 3. The method ofclaim 2, wherein each of said samples is an image file.
 4. The method ofclaim 1, wherein W and W are generated from true relationships amongdata pairs.
 5. The method of claim 4, wherein said data pairs are classlabels of data.
 6. The method of claim 1, wherein and each column offeature matrix X is a low dimensional representation of thecorresponding column of U.
 7. The method of claim 1, wherein the ratioformation of F(²)(X) is handled without any transformation.
 8. Themethod of claim 1, wherein at least one of similarity matrix W ordissimilarlty matrix W has negative values.
 9. The methd claim 8,wherein Tr(XLX^(T)) and Tr(X LX^(T)) are positive.
 10. The method claim1, wherein said Pareto optimality is applied directly on said ratioformulation of F(²)(X) in the absence of any weighed sum approximation.11. The method of claim 1, wherein said Pareto optimality is appliedthrough a series of Pareto improvement status update iterations definedas a change from a current status (V,X) to a new status (V′,X′) thatachieves a Pareto improvement until said Pareto optimal is achieved, anda status update is a Pareto improvement if either of the following twoconditions is satisfied:F ⁽¹⁾(V′,X′)<F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)≦F ⁽²⁾(V,X)  1)F ⁽¹⁾(V′,X′)≦F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)<F ⁽²⁾(V,X)  2) and wherein a current status is a Paretooptimal (V*,X*) if there is no other status (V′,X′) such that a statusupdate iteration from (V*,X*) to (V′,X′) is a Pareto improvement. 12.The method of claim 1, wherein said Pareto optimality is applied toF(¹)(V,X) and to F(²)(X) through iteratively multiplicative updatesuntil said Pareto optimal is achieved; and letting λ=F(²)(X);${T_{ij}^{(1)} = \frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}};$letting$T_{ij}^{(2)} = \frac{\left( {X\left( {L - {\lambda\overset{\_}{\; L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\;\overset{\_}{L}}} \right)}^{+} \right)_{ij}}$letting letting matrix (L −λ L)⁺ be defined as (L −λ L)⁺=A⁺=[A_(ij) ⁺]and $A_{ij}^{+} = \left\{ \begin{matrix}A_{ij} & {{{if}\mspace{14mu} A_{ij}} > 0} \\0 & {{otherwise};}\end{matrix} \right.$ letting matrix (L −λ L)⁻ be defined as (L −λL)⁻=A⁻=[A_(ij) ⁻] and $A_{ij}^{-} = \left\{ \begin{matrix}{- A_{ij}} & {{{if}\mspace{14mu} A_{ij}} < 0} \\0 & {{otherwise};}\end{matrix} \right.$ and said multiplicative updates are defined as:$\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right.$X_(ij) ← X_(ij) ⋅ h(T_(ij)⁽¹⁾, T_(ij)⁽²⁾) where${h\left( {a,b} \right)} = \left\{ \begin{matrix}{\min\left( {a,b} \right)} & {{{if}\mspace{14mu} a} > {1\mspace{14mu}{and}\mspace{14mu} b} > 1} \\{\max\left( {a,b} \right)} & {{{if}\mspace{14mu} a} < {1\mspace{14mu}{and}\mspace{14mu} b} < 1} \\1 & {{otherwise}.}\end{matrix} \right.$
 13. The method of claim 12, wherein said Paretooptimal is achieved when said multiplicative updates reach a stationarypoint.
 14. The method of claim 1, wherein similarity matrix W anddissimilarity matrix W are defined by the concept of within-class andbetween-class distances of Linear Discriminant Analysis (LDA).
 15. Themethod of claim 14, wherein: similarity matrix W =[W_(ij)] is definedas: $W_{ij} = \left\{ \begin{matrix}\frac{1}{n_{c}} & {{{if}\mspace{14mu} y_{i}},{y_{j} \in c}} \\0 & {otherwise}\end{matrix} \right.$ wherein y_(i) is a class label of the i-th sample,y_(j) is a class label of the j-th sample, and n_(c) is the size ofclass c ; and dissimilarity matrix W=[ W _(ij)] is defined as${\overset{\_}{W}}_{ij} = {\frac{1}{n} - W_{ij}}$ wherein n is thenumber of data points.
 16. A method of classifying test data,comprising: arranging a set of training data into a data matrix U;applying the supervised nonnegative factorization method of claim 1 todata matrix U to identify the Pareto optimal state V* and X* offactorizing matrices V and X at the Pareto optimal; classifying saidtest data only according to the classification defined by X*.
 17. A dataclassification system for classifying test data, comprising: a dataprocessing device with access to a data matrix U of training data andwith access to said test data, said data matrix U being defined as UεR^(d×n); wherein an intrinsic graph G is defined as G ={U,W} , eachcolumn of U εR^(d×n) representing a vertex and each element ofsimilarity matrix W measuring the similarity between vertex pairs; apenalty graph G is defined as G={U, W} and each element of dissimilaritymatrix W measures unfavorable relationships between said vertex pairs;an intrinsic diagonal matrix D is defined as D=[D_(ij)] andD_(ii)=Σ_(j=1) ^(n)W_(ij); an intrinsic Laplacian matrix L is defined asL=D−W ; a penalty diagonal matrix D is defined as D=[ D _(ij)] and D_(ii)=Σ_(j=1) ^(n) W _(ij); a penalty Laplacian matrix L is defined asL= D− W; a basis matrix V is defined as V εR^(dxr); a feature matrix Xis defined as X εR^(rxn); a measure of the compactness of intrinsicgraph G is defined by the weighted sum of squared distances defined asΣ_(i<j) ^(n)W_(ij)∥x_(i)=x_(j)∥²=Tr(XLX^(T)), wherein x_(i) is the i-thcolumn of X and x_(j) is j-th column of X; a measure of the separabilityof penalty graph G is defined by the weighted sum of squared distancesdefined as Σ_(i<j) ^(n) W _(ij)∥x_(i)−x_(j)∥²=Tr(X LX^(T)), whereinx_(i) is the i-th column of X and x_(j) is j-th column of X; F(¹)(V, X)defines an objective of NMF (nonnegative matrix factorization), whereinas F(¹)(V,X)=∥U −VX ∥_(F) ²; F(²)(X) defines an objective of graphembedding, where${{F^{(2)}(X)} = \frac{{Tr}\left( {X\; L\; X^{T}} \right)}{{Tr}\left( {X\overset{\_}{L}X^{T}} \right)}};$and said data processing device applies Pareto optimality to F(¹)(V,X)and F(²)(X) , defines the final state V*,X* of matrices V and X at thePareto optimal resulting from application of said Pareto optimality as afactorization of data matrix U, and classifies said test data accordingto the classification defined by X*.
 18. The system of claim 17, whereindata matrix U is comprised of n samples and each column of U representsa sample.
 19. The system of claim 18, wherein each of said samples is animage file.
 20. The system of claim 17, wherein said data pairs areclass labels of data.
 21. The system of claim 17, wherein and eachcolumn of feature matrix X is a low dimensional representation of thecorresponding column of U.
 22. The system of claim 17, wherein the ratioformation of F(²)(X) is handled without any transformation.
 23. Thesystem of claim 17, wherein at least one of similarity matrix W ordissimilarlty matrix W has negative values.
 24. The system claim 17,wherein said Pareto optimality is applied directly on said ratioformulation of F(²)(X) in the absence of any weighed sum approximation.25. The system of claim 17, wherein said Pareto optimality is appliedthrough a series of Pareto improvement status update iterations definedas a change from a current status (V,X) to a new status (V′,X′) thatachieves a Pareto improvement until said Pareto optimal is achieved, anda status update is a Pareto improvement if either of the following twoconditions is satisfied:F ⁽¹⁾(V′,X′)<F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)≦F ⁽²⁾(V,X)  1)F ⁽¹⁾(V′,X′)≦F ⁽¹⁾(V,X)andF ⁽²⁾(V′,X′)<F ⁽²⁾(V,X)  2) and wherein a current status is a Paretooptimal (V*,X*) if there is no other status (V′,X′) such that a statusupdate iteration from (V*,X*) to (V′,X′) is a Pareto improvement. 26.The system of claim 17, wherein said Pareto optimality is applied toF(¹)(V,X) and to F(²)(X) through iteratively multiplicative updatesuntil said Pareto optimal is achieved; and letting λ=F(²)(X);${T_{ij}^{(1)} = \frac{\left( {V^{T}U} \right)_{ij}}{\left( {V^{T}{VX}} \right)_{ij}}};$letting$T_{ij}^{(2)} = \frac{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{-} \right)_{ij}}{\left( {X\left( {L - {\lambda\overset{\_}{L}}} \right)}^{+} \right)_{ij}}$letting letting matrix (L−λ L)⁺ be defined as (L−λ L)⁺=A⁺=[A_(ij) ⁺] and$A_{ij}^{+} = \left\{ \begin{matrix}A_{ij} & {{{if}\mspace{14mu} A_{ij}} > 0} \\0 & {{otherwise};}\end{matrix} \right.$ letting matrix (L−λ L)⁻ be defined as (L−λL)⁻=A⁻=[A_(ij) ⁻] and $A_{ij}^{-} = \left\{ \begin{matrix}{- A_{ij}} & {{{if}\mspace{14mu} A_{ij}} < 0} \\0 & {{otherwise};}\end{matrix} \right.$ and said multiplicative updates are defined as:$\left. V_{ij}\leftarrow{V_{ij}\frac{\left( {UX}^{T} \right)_{ij}}{\left( {VXX}^{T} \right)_{ij}}} \right.$X_(ij) ← X_(ij) ⋅ h(T_(ij)⁽¹⁾, T_(ij)⁽²⁾) where${h\left( {a,b} \right)} = \left\{ \begin{matrix}{\min\left( {a,b} \right)} & {{{if}\mspace{14mu} a} > {1\mspace{14mu}{and}\mspace{14mu} b} > 1} \\{\max\left( {a,b} \right)} & {{{if}\mspace{14mu} a} < {1\mspace{14mu}{and}\mspace{14mu} b} < 1} \\1 & {{otherwise}.}\end{matrix} \right.$
 27. The system of claim 26, wherein said Paretooptimal is achieved when said multiplicative updates reach a stationarypoint.
 28. The system of claim 17, wherein: similarity matrix W=[W_(ij)] is defined as: $W_{ij} = \left\{ \begin{matrix}\frac{1}{n_{c}} & {{{if}\mspace{14mu} y_{i}},{y_{j} \in c}} \\0 & {otherwise}\end{matrix} \right.$ wherein y_(i) is a class label of the i-th sampleand n_(c) is the size of class c; and dissimilarity matrix W=[ W _(ij)]is defined as ${\overset{\_}{W}}_{ij} = {\frac{1}{n} - W_{ij}}$ whereinn is the total number of data points.